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Abstract 

We provide a sound and consistent founda- 
tion for the use of nonrandom exploration 
data in "contextual bandit" or "partially la- 
beled" settings where only the value of a cho- 
sen action is learned. 

The primary challenge in a variety of settings 
is that the exploration policy, in which "of- 
fline" data is logged, is not explicitly known. 
Prior solutions here require either control 
of the actions during the learning process, 
recorded random exploration, or actions cho- 
sen obliviously in a repeated manner. The 
techniques reported here lift these restric- 
tions, allowing the learning of a policy for 
choosing actions given features from histori- 
cal data where no randomization occurred or 
was logged. 

We empirically verify our solution on a rea- 
sonably sized set of real- world data obtained 
from an online advertising company. 



1. The Problem 

Consider the advertisement display problem, where a 
search engine company chooses an ad to display which 
is intended to interest the user. Revenue is typically 
provided to the search engine from the advertiser only 
when the user clicks on the displayed ad. This problem 
is of intrinsic economic interest, resulting in a substan- 
tial fraction of income for several well known compa- 
nies such as Google, Yahoo!, and Facebook. Further- 



more, existing trends imply this problem is of growing 
importance. 

Before discussing the approach we propose, it's impor- 
tant to formalize and generalize the problem, and then 
consider why more conventional approaches can fail. 

The warm start problem for contextual 
exploration 

Let X be an arbitrary input space, and A = {1, • • • ,k} 
be a set of actions. An instance of the contextual bandit 
problem is specified by a distribution D over tuples 
{x,r) where x G A" is an input and r G [0,1]*'' is a 
vector of rewards (Langford & Zhang, 2008). Events 
occur on a round by round basis where on each round 
t: 

1. The world draws {x, r) D and announces x. 

2. The algorithm chooses an action a ^ A, possibly 
as a function of x and historical information. 

3. The world announces the reward of action a. 

It is critical to understand that this is not a standard 
supervised learning problem, because the reward of 
other actions a' ^ a la not revealed. 

The standard goal in this setting is to maximize the 
sum of rewards Ta over the rounds of interaction. In 
order to do this well, it is essential to use previously 
recorded events to form a good policy on the first 
round of interaction. This is known as the "warm 
start" problem, and is the subject of this paper. For- 
mally, given a dataset of the form S — (x, a, Va)* gen- 
erated by the interaction of an uncontrolled logging 
policy, we want to construct a policy h maximizing 
(or approximately maximizing) 

E{x.f)^D[^hi^x)\- 
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Approaches that fail 

There are several approaches that may appear to solve 
this problem, but turn out to be inadequate: 

1. Supervised learning. We could learn a regres- 
sor s : X X ^ — 7> [0, 1] which is trained to pre- 
dict the reward, on observed events conditioned 
on the action a and other information x. From 
this regressor, a policy is derived according to 
hix) = argmax^g^ s(a;, a). A flaw of this ap- 
proach is that the argmax may extend over a set 
of choices not included in the training data, and 
hence may not generalize at all (or only poorly). 
This can be verified by considering some extreme 
cases. Suppose that there are two actions a and 
h with action a occurring 10^ times and action h 
occuring 10^ times. Since action h occurs only a 
10^^ fraction of the time, a learning algorithm 
forced to trade off between predicting the ex- 
pected value of Ta and rf, overwhelmingly prefers 
to estimate ra well at the expense of accurate es- 
timation for r}y. And yet, in application, action 
h may be chosen by the argmax. This problem 
is only worse when action h occurs zero times, as 
might commonly occur in exploration situations. 

2. Bandit approaches. In the standard setting these 
approaches suffer from the curse of dimensional- 
ity, because they must be applied conditioned on 
X. In particular, applying them requires data lin- 
ear in X X A, which is extraordinarily wasteful. 
In essence, this is a failure to take advantage of 
generalization. 

3. Contextual Bandits. Existing approaches to con- 
textual bandits such as EXP4 (Auer et al., 2002) 
or Epoch Greedy (Langford & Zhang, 2008), re- 
quire either interaction to gather data or require 
knowledge of the probability the logging policy 
chose the action a. In our case the probability is 
unknown, and it may in fact always be 1. 

4. Exploration Scavenging. It is possible to recover 
exploration information from action visitation fre- 
quency when a logging policy chooses actions in- 
dependent of the input x (but possibly dependent 
on history) (Langford et al., 2008). This doesn't 
fit our setting, where the logging policy is surely 
dependent on the query. 

Our Approach 

Our approach naturally breaks down into three steps. 
1. For each event (x,a, r^), estimate the probability 



7r(a|a;) that the logging policy chooses action a 
using regression. 

2. For each event, create a synthetic con- 
trolled contextual bandit event according to 
(cc, a, Tq, 1/ max{7r(a|a;), r}) where r > is some 
parameter. The fourth clement in this tuple, 
1/ max{7r(a|a;), r}, is an importance weight that 
specifies how important the current event is 
for training. The parameter r may appear 
mysterious at first, but is critical for numeric 
stability. 

3. Apply an offline contextual bandit algorithm to 
the set of synthetic contextual bandit events. In 
our experimental results a variant of the argmax 
regressor is used with two critical modifications: 

(a) We limit the scope of the argmax to those 
actions with positive probability. 

(b) We importance weight events so that the 
training process emphasizes good estimation 
for each action equally. 

It should be emphasized that the theoretical anal- 
ysis in this paper applies to any algorithm for 
learning on contextual bandit events — we chose 
this one because it is a simple modification on ex- 
isting (but fundamentally broken) approaches. 

Three critical questions arise when considering this ap- 
proach. 

1. What does 7r(a|a;) mean, given that the logging 
policy may be deterministically choosing an ac- 
tion (ad) a given features x7 The essential ob- 
servation is that a policy which deterministically 
chooses action a on day 1 and then deterministi- 
cally chooses action b on day 2 can be treated as 
randomizing between actions a and b with proba- 
bility 0.5 when the number of events is the same 
each day, and the events are IID. Thus 7r(a|a;) is 
an estimate of the expected frequency with which 
action a would be displayed given features x over 
the timespan of the logged events. In section 3 we 
show that this approach is sound in the sense that 
in expectation it provides an unbiased estimate of 
the value of new policy. 

2. How do the inevitable errors in 7r(a|x) influence 
the process? It turns out they have an effect which 
is dependent on r. For very small values of r, the 
estimates of 7r(a|a;) must be extremely accurate 
to yield good performance while for larger values 
of T less accuracy is required. In Section 3.1, we 
prove this robustness property. 
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3. What influence does the parameter r have on the 
final result? While creating a bias in the estima- 
tion process, it turns out that the form of this bias 
is mild and relatively reasonable — actions which 
are displayed with low frequency conditioned on 
X effectively have an underestimated value. This 
is exactly as expected for the limit where actions 
have no frequency. In section 3.1 we prove this. 

We close with a generalization from policy evaluation 
to policy selection with a sample complexity bound in 
section 3.2 and then experimental results in section 4 
using a real ad dataset. 

2. Formal Problem Setup and 
Assumptions 

Let 7ri,...,7rT be T policies, where, for each t, -Kt is 
a function mapping an input from X to a (possibly 
deterministic) distribution over A. The learning algo- 
rithm is given a dataset of T samples, each of the form 
{x^a,ra) G X X A X [0,1], where (a;,r) is drawn from 
D as described in Section 1, and the action a ~ T^t{x) 
is chosen according to the tih. policy. We denote this 
random process by (a;,a, ra) ^ {D,-Kt{-\x)). Similarly, 
interaction with the T policies results in a sequence 
5 of T samples, which we denote S ^ {D,TTt{-\x))-[^i. 
The learner is not given prior knowledge of the tt^. 



Offline policy estimator 

Given a dataset of the form 



(1) 



where \ft,xt € X,at G A,rt.at G [0, 1], we form a pre- 
dictor TT : X X A ^ [0,1] and then use it with a thresh- 
old T G [0,1] to form an offline estimator for the value 
of a policy h. 

Formally, given a new policy h : X A and a dataset 
5, define the estimator: 



1 ral{h{x) = a) 



(2) 



where /(■) denotes the indicator function. 



The purpose of r is to upper bound the individual 
terms in the sum and is similar to previous methods 
(Owen & Zhou, 1998). 



logging policy (analyzed in Subsection 3.1); second, 
we have a policy optimization step, where our we uti- 
lize our estimated logging policy (analyzed in Subsec- 
tion 3.2). Our main result. Theorem 3.2. provides a 
generalization bound — addressing the issue of how 
both the estimation and optimization error contribute 
to the total error. 

The logging policy tt^ may be deterministic, implying 
that conventional approaches relying on randomiza- 
tion in the logging policy arc not applicable. We show 
next that this is ok when the world is IID and the pol- 
icy varies over its actions. We effectively substitute the 
standard approach of randomization in the algorithm 
for randomization in the world. 

A basic claim is that the estimator is expectation 
equivalent to a stochastic policy defined as follows: 



7r(a|a;) = Et^uNiF(i,...,T) [7i"t(a|x)], 



(3) 



where UNIF(---) denotes the uniform distribution. 
The stochastic policy tt chooses an action uniformly 
at random over the T policies iTt- Our first result is 
that the expected value of our estimator is the same 
when the world chooses actions according to either tt 
or to the sequence of policies TTt . Although this result 
and its proof are straight-forward, it forms the basis 
for the rest of the results in our paper. Note that the 
policies TTt may be arbitrary but we have assumed that 
they do not depend on the data used for evaluation. 
Allowing for the offline evaluation of policies using the 
same data they are trained on is an important open 
problem. 

Theorem 3.1. For any contextual bandit problem D 
with identical draws over T rounds, for any sequence 
of possibly stochastic policies 7rt(a|a;) with tt derived as 
above, and for any predictor tt, 



-(D,-n-t(-|2:))f^i^#'('S') — -E^(2:,r)~D,a~7r(-|x) 



rJih^x) 



max{7r(a|a;), r} 
(4) 



3. Theoretical Results 

We now present our algorithm and main theoretical 
results. The main idea is twofold: first, we have a pol- 
icy estimation step, where we estimate the (unknown) 



This theorem relates the expected value of our estima- 
tor when T policies are used to the much simpler and 
more standard setting where a single fixed stochastic 
policy is used. 
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Proof. 



E[yA] < 



ral{h{x) = g) 
^ ' ^ ' ' max|7r(a|a;), r) 

ral{h{x) = a) 



^7r(a|a;)- 



^ max{7r(a|a;), t} 

^ 1 ^ ral{h{x) = a) 



E{x,f)r^D7^^^'^t{a\x) 



{x,r)r^D 



T 



E, 



{x,r)T^jjT 



T 



= E, 



(a;,r)^~X)^,at~7rt(-|a;) 



niax{7r(a|x), r} 

ral{h{x) = a) 
max{7r(a|a;), t} 

ral{h{x) = g) 
max{7r(g|a;), r} 

n^aj{h{xt) ^ gf) 
max{7r (gt I a::t),T} 

1 ri^atliM^t) = at) 



nt{-\x) 



E 



max{7r(gi|xt), r} 
1 ral{h{x) = a) 



Each equality follows form linearity of expectation, re- 
labeling, or the definition of expectation. The identical 
draws assumption is used in 6th equality. □ 

3.1. Policy Estimation 

In this section we show that for a suitable choice of t 
and TT our estimator is sufficiently accurate for evalu- 
ating new policies h. We aggressively use the simpli- 
fication of the previous section, which shows that we 
can think of the data as generated by a fixed stochastic 
policy TT, i.e. TTi = tt for all t. 

For a given estimate tt of tt define the "regret" to be a 
function reg:X — ^ [0, 1] by 

reg(a;) = max [(7r(a|a;) — 7r(g|a;))^] . (5) 

Our first result is that the new estimator is consistent. 
In the following theorem statement, /(•) denotes the 
indicator function, 7r(g|x) the probability that the log- 
ging policy chooses action a on input x, and our 
estimator as defined by Equation 2 based on parameter 



I{TT{h{x)\x) > t) 



Vreg(x) 



In the above, the expectation W^y^] is taken over all 
sequences of T tuples (a;, g, r) where (x, r) ^ D and 
a ~ 7r(-|a;).-^ 

This lemma bounds the bias in our estimate of V'^{x). 
There are two sources of bias — one from the error of 
7r(g|a;) in estimating 7r(g|x), and the other from thresh- 
old T. For the first source, it's crucial that we analyze 
the result in terms of the squared loss rather than (say) 
loo loss, as reasonable sample complexity bounds on 
the regret of squared loss estimates are achievable. 

Proof. Consider a fixed x. Define the following quan- 
tity 



■K{h{x)\x) 



max{7r(/i(a;)|a;), r} 



V^{x)-V^\x). 



The quantity 5x is the error of our estimator condi- 



tioned on X and satisfies E2;[(52. 



that \5t\ < 



'n{h[x)\x) 



- 1 



[Vt^] - V^. Note 



max{7r(/i(a;) |a;) ,t} 

Wc consider two disjoint cases. 

First, suppose that 7r(ft,(a;)|a;) < r. Then, 5^. is less 
than or equal to zero, due to the max operation in the 
denominator and the fact that rewards are positive. 
Thus, we have that < , when the expectation 

is taken over the x for which T:[h[x)\x) < t. As an 
aside, note that \5x\ can have magnitude as large as 
1. In other words, in this situation, the estimator may 
drastically underestimate the value of policy h but will 
never overestimate it. 

Second, suppose that T:{h{x)\x) > t. Then, we have 
that 



< 



< 



TT{h{x)\x) — max{7T(/i(x)|a:), r} 



max{7r(/i(a;)|a;), r} 
\/reg(a;) 



Lemma 3.1. Let tt be any function from X to dis- 
tributions over actions A. Let h : X ^ A be any 
deterministic policy. Let V^{x) = ^r-^D{-\x)['''h{x)\ de- 
note the expected value of executing policy h on input 
X. We have that 



E, 



I{'K{h{x)\x) > t) • V'\x) 



\/reg(a;) 



< 



Expanding 5x and taking the expectation over x for 
which 'K{h{x)\x) > r yields the desired result. □ 

Corollary 3.1. Let tt be any function from X to dis- 
tributions over actions A. Let h : X A be any 



^Note that varying T does not change the expectation 
of our estimator, so T has no effect in the theorem. 
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deterministic policy. If '!i{h{x)\x) > r for all inputs x, 
then 



(6) 



Proof. Follows from examining the second part of the 
proof of Lemma 3.1 and applying Jensen's inequality. 

□ 

Lemma 3.1 shows that the expected value of our esti- 
mate Vj^ of a policy h is an approximation to a lower 
bound of the true value of the policy h where the ap- 
proximation is due to errors in the estimate tt and the 
lower bound is due to the threshold r. When n — tt, 
then the statement of Lemma 3.1 simplifies to 

E^[l{TT{h{x)\x) >t) -V^ix)] < E[y/] < V'\ 

Thus, with a perfect predictor of tt, the expected value 
of the estimator is a guaranteed lower bound on 
the true value of policy h. However, as the left-hand- 
side of this statement suggests, it may be a very loose 
bound, especially if the action chosen by h often has a 
small probability of being chosen by tt. 

The dependence on 1/t in Lemma 3.1 is somewhat un- 
settling, but unavoidable. Consider an instance of the 
bandit problem with a single input x and two actions 
ai, 02. Suppose that 7r(ai \x) = t + e for some positive 
e and h{x) = ai is the policy we are evaluating. Sup- 
pose further that the rewards are always 1 and that 
7r(ai|a::) = r. Then, the estimator satisfies = 
7r(ai|a;)/7T(ai|a::) = {T + e)/T. Thus, the expected error 
in the estimate is E[VP] - F'' = |(r + e)/r - 1| = e/r, 
while the regret of tt is (7r(ai|x) — 7r(ai|a;))^ = e^. 

3.2. Policy Optimization 

The previous section proves that we can effectively 
evaluate a policy h by observing a stochastic policy tt, 
as long as the actions chosen by h have adequate sup- 
port under tt, specifically TT(h(x)\x) > t for all inputs 
x. However, we are often interested in choosing the 
best policy h from a set of policies % after observing 
logged data. Furthermore, as described in Section 2, 
the logged data are generated from T fixed, possibly 
deterministic, policies tti, . . . ,ttt as described in sec- 
tion 2 rather than a single stochastic policy. As in 
Section 3 we define the stochastic policy tt by Equa- 
tion 3, 

7r(a|a;) = Et^uNiF(i,...,T) [7i"t(a|a;)] 

The results of Section 3.1 apply to the policy optimiza- 
tion problem. However, note that the data are now 
assumed to be drawn from the execution of a sequence 



of T policies tti, . . . ,ttt, rather than by T draws from 

TT. 

Next, we show that it is possible to compete well with 
the best hypothesis in % that has adequate support 
under tt (even though the data are not generated from 
tt). 

Theorem 3.2. Let tt he any function from X to dis- 
tributions over actions A. Let % be any set of deter- 
ministic policies. Define Ti = {h G T-l \ TT{h{x)\x) > 
T, V a; G X} and h = argmax^g.j^{y''}. Let h = 
argmax/jg-^{y^} be the hypothesis that maximizes the 
empirical value estimator defined in Equation 2. Then, 
with probability at least 1 — S, 



y'^>l/ft __ ^E,[reg(a;)] + 



2T 



(7) 



where veg{x) is defined, with respect to tt, in Equa- 
tion 5. 



Proof. First, given a dataset {xt,at,rt^at), t ~ 
1, . . . ,T, generated by the process described in Sec- 
tion 2, note that it is straight-forward to apply Hoeffd- 
ing's bound (Hocffding, 1963) to the random variables 

^ max{7r(at |xt),r} ' ' L tt J I — 

r '\/ '"2t'^ holds with probability at least 1 — 5, for a 
fixed policy h. It is important to note here that the 
Xt are independent but not identical, since the action 
at time t is chosen according to policy ttj. The previ- 
ous argument can be made to hold for all ft, G i? by 
replacing 5 with 5 /\H\ and applying the union bound. 

Let Q = {D,TTt{-\x))J'^i be the distribution over se- 
quences of T samples {x, a, ra) ^ X x A x [0,1] gener- 
ated by executing the T logging policies ttj in sequence, 
as described in section 2. Let Q' = {D, a ^ 7r(-|2;)) be 
the distribution over samples of the form {x,a,ra) G 
X X A X such that {x,r) ~ D and a ^ tt{-\x). 
The T samples used in the estimator are obtained 
from a single draw from Q. 
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Now, we have that 

yh 



IiT:{h{x)\x) > t) 



T 

v/E,[rcg(x)] 



^ ^^'^ v/E.[reg(x")I 1 /ln(2|g|/<5) 
— J- j- 



2T 



^ VE.,[rcg(x)] 1 /ln(2|g|/^) 



v/E,[reg(x)] 



, v/E^[reg(a;)] 



2r 

2 / ln(2|g|/<5) 

rV 2r 

2 / ln(2|g|/^) 



> V' 



^ 2VE,[reg(x)] 



2 / ln(2|g|/^) 
T V 2T 



The first step follows from Lemma 3.1. The second 
from the fact that regret is always non-negative. The 
third from an application of Jensen's inequality. The 
forth and eighth from an application of Theorem 3.1. 
The fifth and seventh from an application of Hoeffd- 
ing's bound as detailed above. The sixth from the def- 
inition of h. The final step follows from Corollary 3.1 
and observing that h G H. □ 

The proof of Theorem 3.2 relies on the lower-bound 
property of our estimator (the left-hand side of In- 
equality stated in Lemma 3.1). In other words, if H 
contains a very good policy that has little support un- 
der TT, we will not be able to detect that by our esti- 
mator. On the other hand, our estimation is safe in 
the sense that we will never drastically overestimate 
the value of any policy in H. This "underestimate, 
but don't overestimate" property is critical to the ap- 
plication of optimization techniques, as it implies we 
can use an unrestrained learning algorithm to derive a 
warm start policy. 

4. Empirical Evaluation 

We evaluated our method on a real-world Internet ad- 
vertising dataset. We have obtained proprietary data 
from an online advertising company, covering a period 
of approximately one month. The data are comprised 



of logs of events (x, a, y), where each event represents 
a visit by a user to a particular web page x, from a set 
of web pages X. From a large set of advertisements 
A, the commercial system chooses a single ad a for the 
topmost, or most prominent position. It also chooses 
additional ads to display, but these were ignored in our 
test. The output y is an indicator of whether the user 
clicked on the ad or not. 

The total number of ads in the data set is approxi- 
mately 880, 000. The training data consist of 35 mil- 
lion events. The test data contain 19 million events 
occurring after the events in the training data. The 
total number of distinct web pages is approximately 
3.4 million. 

We trained a policy h to choose an ad, based on the 
current page, to maximize the probability of click. For 
the purposes of learning, each ad and page was repre- 
sented internally as a sparse high-dimensional feature 
vector. The features correspond to the words that 
appear in the page or ad, weighted by the frequency 
with which they appear. Each ad contains, on aver- 
age, 30 ad features and each page, approximately 50 
page features. The particular form of / was linear 
over all features of its input {x,y), which is a sparse 
high-dimensional feature vector representing the com- 
bination of the page and ad.^ For instance, every pair 
of possible words had a corresponding feature. For ex- 
ample, given the two words "apple" and "ipod", the 
corresponding feature "apple-ipod" has a value of 0.25 
when the first word, "apple" , appeared in the page x 
with frequency 0.5 and the second word, "ipod", ap- 
peared in the ad a with frequency 0.5. 

Using all the data, we modeled the logging policy using 
simple empirical estimation: 

\{t\{at^a) A{xt^x)}\ 



(8) 



\{t\xt^x}\ 

In words, for each page and ad pair (x, a), we com- 
puted the number of times a appeared on page x in 
the data. The decision to use all of the data requires 
careful consideration. Some alternatives to consider 
are: 

1. Training data only. Since the set of ads changes 
over time, many ads appearing in the test data 
do not occur at all in the training data. Con- 
sequently, reliably predicting the performance on 
test data is problematic. 

2. Training data for training set and test data for 
test set. This approach has an inherent bias to- 



^Technically the feature vector that the regressor uses 
is the Cartesian product of the page and ad vectors. 
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wards incorrectly high scores on the test set. In an 
extreme case, suppose that only one ad appears 
on a (rare) webpage in the test set. Then, any 
policy selecting from amongst the set of appear- 
ing ads must select this ad. 



3. All data. This approach means that policies must 
generally select from a larger set of ads than are 
available at any moment in time for the live sys- 
tem, implying that the policy evaluation is gener- 
ally pessimistic. Note that the logging policy in 
contrast is optimistically evaluated, because the 
set of test-time available ads is smaller than the 
set of ads available over both test-time and train- 
time ads, implying the frequency estimates for 
test-time ads on the train-|-test dataset are gener- 
ally smaller than an estimate using just test-time 
ads."^ These smallcr-than-necessary frequency es- 
timates imply that the logging policy evaluation 
is optimistic since events are weighted by the in- 
verse frequency. Consequently, this choice pro- 
vides a conservative estimate for new policies and 
an optimistic choice for the older (logging) policy. 

The particular policy that was optimized, had an 
argmax form: h(x) = argmax^gp(j^^^{/(2;, a)}, with 
a crucial distinction from previous approaches in how 
f{x, a) was trained. Here f : X xA ^ [0, 1] is a regres- 
sion function that is trained to estimate probability of 
click, and C{X) = {a € A \ 7i-(a|a;) > 0} is a set of 
feasible ads. 

The training samples were of the form (x, a, y), where 
y = 1 if the ad a was clicked after being shown on 
page X or ?/ = if it wasn't clicked. The regressor 
/ was chosen to approximately minimize the weighted 
squared loss: — ^^T"//^;"-^^, — r-- 

Stochastic gradient descent was used to minimize the 
squared loss on the training data. 

During the evaluation, we computed the estimator on 

^As an extreme example, suppose we log data for two 
days and we use the first day for training and the second 
day for testing. Suppose that only a single ad ai appears 
in the train set, and a single ad a2 appears in the test set, 
due to the fact that the budget for ad ai ran out after the 
first day. Our empirical estimate of n{a2\x) on the test 
set used in the denominator of our estimator (Equation 8) 
will be 1/2. In fact the true probability of a2 on the test 
set is 1. Thus, the value of the logging policy will be over 
estimated by a factor of 2. Suppose further that ad ai 
is indeed better than 02. The evaluation of a policy that 
always chooses the better ad, ai, using Equation 8 will be 
zero, a drastic underestimate of its true value. 



the test data {xt,at,yt)'- 

yh ^ 1 v yJiKxt) at) , . 

T^^max{7r{at\xt),T}- 

As mentioned in the introduction, this estimator is 
biased due to the use of the parameter r > 0. As 
shown in the analysis of Section 3, this bias typically 
results in an underestimate of the true value of the 
policy h. 

We experimented with different thresholds t and pa- 
rameters of our learning algorithm.^ 

4.1. Results 



Method 


r 


Estimate 


Interval 


Learned 


0.01 


0.0193 


[0.0187,0.0206] 


Random 


0.01 


0.0154 


[0.0149,0.0166] 


Learned 


0.05 


0.0132 


[0.0129,0.0137] 


Random 


0.05 


0.0111 


[0.0109,0.0116] 


Naive 


0.05 


0.0 


[0,0.0071] 



The Interval column is computed using the relative en- 
tropy form of the Chernoff bound with S = 0.05 which 
holds under the assumption that variables, in our case 
the samples used in the computation of the estimator 
(Equation 9), are IID. Note that this computation is 
slightly complicated because the range of the variables 
is [0, l/r] rather than [0, 1] as is typical. This is han- 
dled by rescaling by t, applying the bound, and then 
rescaling the results by l/r. 

The "Random" policy is the policy that chooses ran- 
domly from the set of feasible ads: Random(a;) = a ~ 
UNIF(C(A:)), where UNIF(-) denotes the uniform dis- 
tribution. 

The "Naive" policy corresponds to the theoretically 
flawed supervised learning approach detailed in the in- 
troduction. The evaluation of this policy is quite ex- 
pensive, requiring one evaluation per ad per example, 
so the size of the test set is reduced to 8373 examples 
with a click, which reduces the significance of the re- 
sults. We bias the results towards the naive policy by 
choosing the chronologically first events in the test set 
(i.e. the events most similar to those in the training 
set). Nevertheless, the naive policy receives reward, 
which is significantly less than all other approaches. A 
possible fear with the evaluation here is that the naive 
policy is always finding good ads that simply weren't 
explored. A quick check shows that this is not correct- 
the naive argmax simply makes implausible choices. 

''For stochastic gradient descent, we varied the learning 
rate over 5 fixed numbers (0.2,0.1,0.05,0.02,0.01) using 1 
pass over the data. We report on the test results for the 
value with the best training error. 
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Note that we report only evaluation against r = 0.05, 
as the evaluation against r = 0.01 is not significant, 
although the reward obviously remains 0. 

The "Learned" policies do depend on r. As suggested 
by Theorem 3.2. as t is decreased, the effective set 
of hypotheses we compete with is increased, thus al- 
lowing for better performance of the learned policy. 
Indeed, the estimates for both the learned policy and 
the random policy improve when we decrease t from 
0.05 to 0.01. 

The empirical click-through rate on the test set was 
0.0213, which is slightly larger than the estimate for 
the best learned policy. However, this number is not 
directly comparable since the estimator provides a 
lower bound on the true value of the policy due to 
the bias introduced by a nonzero t and because any 
deployed policy chooses from only the set of ads which 
are available to display rather than the set of all ads 
which might have been displayable at other points in 
time. 

The empirical results are generally consistent with the 
theoretical approach outlined here — they provide a 
consistently pessimal estimate of policy value which 
nevertheless has sufficient dynamic range to distin- 
guish learned policies from random policies, learned 
policies over larger spaces (smaller r) from smaller 
spaces (larger r), and the theoretically unsound 
naive approach from sounder approaches which choose 
amongst the the explored space of ads. 

5. Conclusion 

We stated, justified, and evaluated theoretically and 
empirically the first method for solving the warm start 
problem for exploration from logged data with con- 
trolled bias and estimation. This problem is of obvi- 
ous interest to applications for internet companies that 
recommend content (such as ads, search results, news 
stories, etc..) to users. 

However, we believe this also may be of interest 
for other application domains within machine learn- 
ing. For example, in reinforcement learning, the stan- 
dard approach to offline policy evaluation is based 
on importance weighted samples (Kearns et al., 2000; 
Precup et al., 2000). The basic results stated here 
could be applied to RL settings, eliminating the need 
to know the probability of a chosen action explicitly, 
allowing an RL agent to learn from external observa- 
tions of other agents. 

The main restrictive assumption adopted by the Ex- 
ploration Scavenging paper (Langford et al., 2008) is 



that the logging policy chooses actions independently 
of the input. We have introduced a new method 
that works when this assumption is violated. On the 
other hand, we have required the logging policy be 
a sequence of fixed, possibly deterministic, policies, 
whereas the Exploration Scavenging paper allowed for 
the use of logging policies that learn and adapt over 
time. An interesting situation occurs when you al- 
low TTt to depend on the history up to time t. In this 
setting the policy may both adapt (like in the Explo- 
ration Scavenging paper) and choose actions depen- 
dent on the current input. Is there an offiine policy 
estimator which can work in this setting? The most 
generic answer is no, but there may exist some natural 
constraint which encapsulates the approach discussed 
here, as well as in the earlier paper. 
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